[Feature Request] Support Dynamic Slot Allocation in RawBlock Backend · Issue #3392 · LMCache/LMCache

Label
new feature

Is your feature request related to a problem? Please describe.

Summary

Introduce a dynamic slot allocation strategy for RawBlockCore to replace the fixed-slot allocator, minimizing internal fragmentation.

Motivation

The fixed-slot allocation model introduces internal fragmentation in several common scenarios.

The most common sources of this issue are:

Partial (unfull) chunks
Compression-enabled workloads

In all cases, payload sizes are often smaller than the configured slot size, leaving unused space within allocated slots and reducing effective storage utilization.

1. Partial (Unfull) Chunk Allocation

Workloads often produce payloads smaller than the configured slot size.
This occurs in scenarios such as:

Naturally incomplete (unfull) chunks when save_unfull_chunk=True
Prefill-Decode (PD) mode, where save_unfull_chunk is always enabled and uneven payload sizes are generated by design

Example

Fixed Slot (512 KiB)

[########################################################] 512 KiB slot
[##########..............................................] 50 KiB payload
[##########..............................................] 462 KiB unused space

Result:

A large portion of the slot remains unused
Fragmentation accumulates across workloads

2. Compression-Enabled Workloads

When compression is applied, the physical payload becomes smaller than the fixed slot size.

This is especially prominent in architectures such as DeepSeek-V4, where:

Different layer groups use different compression ratios
Physical chunk sizes vary across layers

Example

Slot (512 KiB)

Layer A (low compression)
[####################################....................]

Layer B (high compression)
[############............................................]

Layer C (very high compression)
[######..................................................]

Result:

Highly compressed layers leave large portions of slots unused
Fragmentation becomes uneven across layers
Storage efficiency degrades with layer heterogeneity

Please correct me if I'm wrong

Describe the solution you'd like
A clear and concise description of what you want to happen.

Proposed Change

Replace the fixed-slot allocator with a dynamic slot allocator:

Allocate slot size based on actual payload size
Continue to use a bump-pointer allocation strategy for fast sequential allocation when capacity is available
Use power-of-two size classes to bound allocation granularity and ensure predictable reuse patterns
Maintain a segmented free list for efficient reuse of freed regions
Provide size-aware allocation hints to avoid allocation–eviction mismatch

This design minimizes internal fragmentation and significantly improves effective storage utilization across heterogeneous workloads.

Expected Benefits

Reduced Fragmentation

Dynamic Allocation

[##########] 50 KiB slot
[##############] 80 KiB slot
[########################] 200 KiB slot

Minimal unused space per allocation
Improved storage utilization

Improved Efficiency

Increased effective storage capacity
Better reuse of freed regions
Reduced allocation failures
Eliminates unnecessary padding overhead from fixed-size allocation

Comments

daegyu94 · 2026-05-26

@DongDongJu
I’d appreciate your thoughts on this approach.
If this direction aligns with the overall design, I’d be happy to proceed with an implementation.

DongDongJu · 2026-05-26

This is always the problem of having fixed memory granularity.
I think we can have a linux OS + io_uring like behavior.
How about having fixed pre-allocated memory pool with different sizes like buddy allocator?
We can reuse it and even we can pin the memory to gpu side as well for improving the read performance.
WDYT?

daegyu94 · 2026-05-27

This is always the problem of having fixed memory granularity. I think we can have a linux OS + io_uring like behavior. How about having fixed pre-allocated memory pool with different sizes like buddy allocator? We can reuse it and even we can pin the memory to gpu side as well for improving the read performance. WDYT?

What I intended was slot management in a raw‑block backend, rather than the fixed buffer registration used in io_uring.

I agree that fragmentation is a classic issue in fixed memory management.

My idea is to keep the io_uring buffers fixed-size, while making the raw‑block space management dynamic instead of fixed-size.
I was also wondering whether the io_uring buffer size needs to match the raw-block allocation unit, since read/write operations can specify the actual length.

If I’m misunderstanding something here, I’d really appreciate your feedback.

DongDongJu · 2026-05-27

Yeah, exactly. What I meant was not that every allocation should be fully dynamic from raw memory.

I was thinking more like linux slab/slob/slub style allocation on top of a buddy-like memory pool:
we pre-allocate multiple fixed-size buffer classes, and then reuse them based on the requested payload size.

So instead of having only one fixed slot size, we can have several fixed-size pools, e.g. 64KB / 128KB / 256KB / 512KB, and allocate from the smallest class that can fit the payload.

This keeps allocation/reuse simple and predictable, while reducing the internal fragmentation caused by a single fixed slot size.

For io_uring, I think the registered buffers can still be fixed/pre-allocated. The raw-block allocator does not necessarily need to expose the same fixed size as the io_uring buffer, since read/write can still use the actual payload length.

And as you mentioned, if we manage these fixed-size pools carefully, we may also be able to pin/reuse them on the GPU side to improve read performance.

So maybe the design is:

pre-allocated memory pools with several fixed size classes

free list per size class

allocate from the best-fit size class

reuse freed buffers

optionally pin some pools for GPU-side read optimization

This is closer to what I intended than arbitrary variable-size allocation.

WDYT?

daegyu94 · 2026-05-28

Your design looks solid.

From my understanding, the slot allocator can be structured as:

Full chunks: fixed, model-dependent slot size

Partial chunks: size-class pools for smaller payloads

Based on this, we can consider several design choices.
Below are my thoughts:

Partial vs Full ratio tuning

For example: Partial 20%, Full 80%

The ratio should be configurable and tunable

Size classes in Partial

How many size classes should we have?

What should the range be? (e.g., 64 KB ~ 1024 KB)

Allocation fallback

Should we allow fallback from Partial to Full? If so, we need to protect Full capacity (e.g., via reservation)

Full → Partial fallback should be avoided due to the complexity of splitting and merging partial classes

Note: These choices will also impact eviction policy, especially under allocation pressure or failure scenarios

I opened this issue because I thought it would be worth discussing how we should handle partial chunks.
I’m also happy to discuss this offline if that would be easier.

Summary​

Motivation​

1. Partial (Unfull) Chunk Allocation​

Example​

2. Compression-Enabled Workloads​

Example​

Proposed Change​

Expected Benefits​

Reduced Fragmentation​

Improved Efficiency​

Comments​

Summary

Motivation

1. Partial (Unfull) Chunk Allocation

Example

2. Compression-Enabled Workloads

Example

Proposed Change

Expected Benefits

Reduced Fragmentation

Improved Efficiency

Comments